62 research outputs found

    Islandora: Creating and Sustaining an Open Source Community

    Get PDF
    Three years have passed since the formation of the Islandora Foundation was announced at Open Repositories 2013. Since that time, the project has welcomed more than two dozen supporting institutions, hosted Islandora Camps all over the world, and completed four fully community-driven software releases with dozens of new modules built and contributed by the Islandora community. The Islandora project has made the journey from a grant-funded project incubated in a university library, to a vibrant and global community facilitated by a non-profit that exists only by symbiosis with the community it serves. This presentation will provide a general overview of that journey, the current status of the Islandora project (including Islandora CLAW) and community. As the Islandora Foundation enters its fourth year, with a staff of two and a truly community-driven development process, membership in the Islandora Foundation provides the shared governance structure that allows for a sustainable open source repository platform for the GLAM (galleries, libraries, archives, and museums) community

    Capturing the Web Today for Tomorrow: Innovations in capturing and analyzing social media and websites for the new scholarly record

    Get PDF
    The growth of digital sources since the advent of the World Wide Web in 1991, and the commencement of widespread web archiving in 1996, presents profound new opportunities for social and cultural analysis. In simple terms, the 1990s cannot be studied without web archives: they are both primary sources that reflect how people consume and understand media, as well as repositories that document the thoughts, opinions, and activities of millions of everyday people. These are a dream for social historians. However, all of this opportunity brings challenges. The size and complexity of the data requires interdisciplinary collaboration. Historians might have the research questions but not the technical resources or knowledge to work with these sources, requiring outreach to other disciplines. Libraries and archives are perfectly positioned to work in this new emerging field that brings together historians, computer scientists, and information specialists. In this talk, our speakers will discuss the fruits of one collaboration that has emerged at York University, the University of Alberta, and the University of Waterloo. Bringing together librarians, archivists, historians, and computer scientists, as well as an interdisciplinary team of undergraduate and graduate students, this distributed group is developing several web archival analytics projects. They work using a combination of centralized and de-centralized infrastructure to run data analytics, store web archives, provide a publicly-facing portal, and collaborate. Ian and Nick will discuss the challenges of working in an interdisciplinary environment, and give insights into how the team has been working through in-detail case studies of their work with webarchives.ca, Twitter archiving and analysis, Compute Canada, and Warcbase, a web analytics platform. Collaboration between computer scientists, librarians, archivists and humanists is not always a simple one, but it is a collaboration worth perusing.York University Librarie

    The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives

    Get PDF
    The Archives Unleashed project aims to improve scholarly access to web archives through a multi-pronged strategy involving tool creation, process modeling, and community building -- all proceeding concurrently in mutually --reinforcing efforts. As we near the end of our initially-conceived three-year project, we report on our progress and share lessons learned along the way. The main contribution articulated in this paper is a process model that decomposes scholarly inquiries into four main activities: filter, extract, aggregate, and visualize. Based on the insight that these activities can be disaggregated across time, space, and tools, it is possible to generate "derivative products", using our Archives Unleashed Toolkit, that serve as useful starting points for scholarly inquiry. Scholars can download these products from the Archives Unleashed Cloud and manipulate them just like any other dataset, thus providing access to web archives without requiring any specialized knowledge. Over the past few years, our platform has processed over a thousand different collections from over two hundred users, totaling around 300 terabytes of web archives.This research was supported by the Andrew W. Mellon Foundation, the Social Sciences and Humanities Research Council of Canada, as well as Start Smart Labs, Compute Canada, the University of Waterloo, and York University. We’d like to thank Jeremy Wiebe, Ryan Deschamps, and Gursimran Singh for their contributions

    An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter

    Get PDF
    This work is licensed and made available under Creative Commons Attribution 3.0 United States license. Article first appeared in Code4Lib Journal, issue 32, 2016-04-25, Original available here http://journal.code4lib.org/articles/11358This article examines the tools, approaches, collaboration, and findings of the Web Archives for Historical Research Group around the capture and analysis of about 4 million tweets during the 2015 Canadian Federal Election. We hope that national libraries and other heritage institutions will find our model useful as they consider how to capture, preserve, and analyze ongoing events using Twitter. While Twitter is not a representative sample of broader society – Pew research shows in their study of US users that it skews young, college-educated, and affluent (above $50,000 household income) – Twitter still represents an exponential increase in the amount of information generated, retained, and preserved from 'everyday' people. Therefore, when historians study the 2015 federal election, Twitter will be a prime source. On August 3, 2015, the team initiated both a Search API and Stream API collection with twarc, a tool developed by Ed Summers, using the hashtag #elxn42. The hashtag referred to the election being Canada's 42nd general federal election (hence 'election 42' or elxn42). Data collection ceased on November 5, 2015, the day after Justin Trudeau was sworn in as the 42nd Prime Minister of Canada. We collected for a total of 102 days, 13 hours and 50 minutes. To analyze the data set, we took advantage of a number of command line tools, utilities that are available within twarc, twarc-report, and jq. In accordance with the Twitter Developer Agreement & Policy, and after ethical deliberations discussed below, we made the tweet IDs and other derivative data available in a data repository. This allows other people to use our dataset, cite our dataset, and enhance their own research projects by drawing on #elxn42 tweets. Our analytics included: breaking tweet text down by day to track change over time; client analysis, allowing us to see how the scale of mobile devices affected medium interactions; URL analysis, comparing both to Archive-It collections and the Wayback Availability API to add to our understanding of crawl completeness; and image analysis, using an archive of extracted images. Our article introduces our collecting work, ethical considerations, the analysis we have done, and provides a framework for other collecting institutions to do similar work with our off-the-shelf open-source tools. We conclude by ruminating about connecting Twitter archiving with a broader web archiving strategy.Social Sciences and Humanities Research Council of Canada || Insight Grant (435-2015-0011

    See a little Warclight: building an open-source web archive portal with project blacklight

    Get PDF
    In 2014-15, due to close collaboration between UK-based researchers and the UK Web Archive, the open-source Shine project was launched. It allowed faceted search, trend diagram exploration, and other advanced methods of exploring web archives. It had two limitations, however: it was based on the Play framework (which is relatively obscure especially within library settings) and after the Big UK Domain Data for the Arts and Humanities (BUDDAH) project came to an end, development largely languished. The idea of Shine is an important one, however, and our project team wanted to explore how we could take this great work and begin to move it into the wider, open-source library community. Hence the idea of a Project Blacklight-based engine for exploring web archives. Blacklight, an open-source library discovery engine, would be familiar to library IT managers and other technical community members. But what if Blacklight could work with WARCs? The Archives Unleashed team’s first foray towards what we now call “Warclight” — a portmanteau of Blacklight and the ISO-standardized Web ARChive file format — was building a standalone Blacklight Rails application. As we began to realize this doesn’t help those who would like to implement it, development pivoted to building a Rails Engine which, “allows you to wrap a specific Rails application or subset of functionality and share it with other applications or within a larger packaged application.” Put another way, it allows others to use an existing Warclight template to build their own web archive search application. Drawing inspiration from UKWA’s Shine, it allows faceted full-text search, record view, and other advanced discovery options. Warclight is designed to work with web archive data that is indexed via the UK Web Archive’s webarchive-discovery project. Webarchive-discovery is a utility to parse ARCs and WARCs, and index them using Apache Solr, an open source search platform. Once these ARCs and WARCs have been indexed into Solr, it provides us with searchable fields including: title, host, crawl-date, and content type. One of the biggest strengths of Warclight is that it is based on Blacklight. This opens up a mature open source community, which could allow us to go farther if we’re following the old idiom: “If you want to go fast, go alone. If you want to go further, go together.” This presentation will provide and overview of Warclight, and implementation patterns. Including the Archives Unleashed at scale implementation of over 1 billion Solr docs using Apache SolrCloud.This work is primarily supported by the Andrew W. Mellon Foundation. Other financial and in-kind support comes from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo

    Engaging the Public with Web Archives: Providing Access to 10 Years of Political History with WebArchives.ca

    Get PDF
    The Canadian Society of Digital Humanities/Société canadienne des humanités numériques Conference 2016Introduction The growth of digital sources since the advent of the World Wide Web in 1990-91 presents profound opportunities for historians. Large web archives contain billions of webpages, and now make it possible for us to develop large-scale reconstructions of the recent web. Yet the sheer number of these sources presents significant challenges. The Internet Archive's "Wayback Machine" (http://archive.org/web) is a standard entryway to these collections, but requires that the user know the URL of the resource they want to visit; it is not feasible to do large-scale research in this manner. By unlocking the Wayback Machine's underlying WebARCHive (ARC/WARC) files, we can develop methods to track, visualize, and analyze change occurring over time. In this paper, we discuss how we implemented the United Kingdom Web Archive (UKWA) "Shine" interface on a Canadian corpus, and how the provision of a user layer significantly changed levels of user engagement. Project Rationale and Case Study The University of Toronto Library (UTL) began collecting a quarterly crawl in 2005 of Canadian political parties and political interest groups. It includes fifty websites: major and minor political parties, as well as political interest groups such as the Assembly of First Nations and equal marriage advocacy groups. Collecting continues. Despite 2005-2015 having been a pivotal period for Canadian politics, and analytics reveal few took advantage of it. The current portal requires a visit to https://archive-it.org/collections/227 for full-text queries. There is no faceting or significant advanced search features. The interface is largely unusable for broad research questions. Shine To provide access, we implemented the Shine interface (https://github.com/ukwa/shine). Shine provides a web-based interface for interacting with Apache Solr. Using the open-sourced code, we indexed all of the sites, provided explanatory layers, generated additional analytics around what each crawl contained (as some crawls might contain more webpages from say the Liberals, which throws off the relative frequency of keywords), and tried to write better user documentation. We launched http://webarchives.ca as the 2015 Canadian federal election campaign began. Results WebArchives.ca received significant attention. The Canadian Broadcasting Corporation (CBC) carried stories in Canada Votes, the Kitchener-Waterloo affiliate, Spark, as well as talk radio and campus news. We received 17,861 pageviews over 4,000 user sessions, largely between 27 August and 19 October. It also led to research findings, including: * unlike other forms of web content, political parties and interest groups do not archive material on their websites. This eases analysis due to fewer duplicates, but also shows why collecting is time critical; * political parties flip flop: the Conservatives accused the Liberals in 2005 of paying insufficient attention to murdered and missing indigenous women; a complete reversal occurred on the 2015 websites; * significant shifts away from user-generated content on party sites, which experimented and then abandoned widespread commenting and hosting of blogs. These were discoverable due to the Shine/Webarchives.ca interface. Conclusions More work needs to be done. The next step is to work with more Archive-It collections of national/international significance and publicize them in a similar way. At the end of the presentation, I will note an ongoing project we have with Canadian partners to consolidate and provide access to multiple collections.This research was supported by a research grant -- 435-2015-0011 -- issued by Social Sciences and Humanities Research Council

    Islandora and Fedora 4; The Atonement.

    Get PDF
    Open Repositories 2015In the context of repository platforms, Islandora has a fair bit of age, and with that a fair bit of cruft. In the early winter of 2014/2015 the Islandora community began working on a project plan to outline what would be needed in a version of Islandora that would work with of Fedora 4, and what resources would be needed to build it. This presentation will provide an overview of the project, as well as an in-depth technical overview and demonstration of the new functionality

    Digital Preservation at York University Libraries

    Get PDF
    York University Libraries are ten years into a digital preservation program. How did it start, how it did it evolve, what does our policy and documentation look like, and what are the lessons learned? Library organizations are unique, but there is generally a fair bit of overlap where our path, policies and documentation can be of use to other organizations.Council of Atlantic Academic Librarie

    Warclight: A Rails Engine for Web Archive Discovery

    Get PDF
    This paper describes the development of Warclight, a portmanteau of the open-source Blacklight platform and the ISO-standard Web ARChive file format. Warclight allows users to explore web archives that have been indexed into Apache Solr using the UK Web Archive's Web Archive Discovery tool. Referencing previous work, we explain how the standard search engine results page is inadequate to support scholarly inquiries. Instead, Warclight provides full-text and faceted search, as well as faceted browsing, to enable exploration and discovery. Given the large sizes of many web archives, we share experiences with deploying our tool at scale using a federated architecture.This work was primarily supported by the Andrew W. Mellon Foundation and Compute Canada's Research Platforms and Portals program. Additional funding for the project has come from Start Smart Labs and the Social Sciences and Humanities Research Council of Canada

    Content Selection and Curation for Web Archiving: The Gatekeepers vs. the Masses

    Get PDF
    Any preservation effort must begin with an assessment of what content to preserve, and web archiving is no different. There have historically been two answers to the question "what should we archive?'' The Internet Archive's broad entire-web crawls have been supplemented by narrower domain or topic-specific collections gathered by numerous libraries. We can characterize this as content selection and curation by "gatekeepers''. In contrast, we have witnessed the emergence of another approach driven by "the masses''---we can archive pages that are contained in social media streams such as Twitter. The interesting question, of course, is how these approaches differ. We provide an answer to this question in the context of a case study about the 2015 Canadian federal elections. Based on our analysis, we recommend a hybrid approach that combines an effort driven by social media and more traditional curatorial methods.This research was supported by a research grant -- 435-2015-0011 -- issued by Social Sciences and Humanities Research Council
    • …
    corecore